A dataset of mentorship in science with semantic and demographic estimations

Mentorship in science is crucial for topic choice, career decisions, and the success of mentees and mentors. Typically, researchers who study mentorship use article co-authorship and doctoral dissertation datasets. However, available datasets of this type focus on narrow selections of fields and miss out on early career and non-publication-related interactions. Here, we describe MENTORSHIP, a crowdsourced dataset of 743176 mentorship relationships among 738989 scientists across 112 fields that avoids these shortcomings. We enrich the scientists' profiles with publication data from the Microsoft Academic Graph and"semantic"representations of research using deep learning content analysis. Because gender and race have become critical dimensions when analyzing mentorship and disparities in science, we also provide estimations of these factors. We perform extensive validations of the profile--publication matching, semantic content, and demographic inferences. We anticipate this dataset will spur the study of mentorship in science and deepen our understanding of its role in scientists' career outcomes.


Background & Summary
Mentorship is a form of guidance provided by a more experienced person (mentor) to a less seasoned one (mentee). Likewise, mentors in science draw from their experiences to help mentees-who often are early-career researchers-navigate various issues inside and outside of academia. Mentorship is a crucial phase in a scientist's development that has long-term effects throughout her career. Mentorship can occur formally through doctoral and postdoctoral advisor-advisee relationships or informally through collaborations. Mentees not only learn new knowledge and skills from mentors but also get involved in mentors' social connections 1 . Numerous studies have pointed out the association between mentor's characteristics and mentee's academic success, like productivity [2][3][4] , career preference and placement 2, 5, 6 , mentorship fecundity 7,8 , and impact 9 . Despite the large role of mentorship and interest in studying it, previous studies have relied on single-field datasets and indirect signals of mentorship (e.g., co-authorship) and therefore have limited generalizability. Large, curated, and open datasets on mentorship have the potential of bringing significant benefit to our understanding of the phenomenon, similar to how citation and publication datasets have accelerated the emerging field of science of science 10,11 .
Studying mentorship requires access to a broad set of relationship types, including publication. There are a few data sources for mentorship in science (Table 1); here, we list a handful of them. The Mathematics Genealogy Project (MGP) 12 is an online database for academic genealogy only in mathematics, though more broadly construed to include "mathematics education, statistics, computer science, or operations research". MGP lacks publication records. The Astronomy Genealogy Project is a similar online database confined to astronomy that also does not have publication information 13,14 . ProQuest is a database of theses and dissertations predominantly from the US 15 . Although it is multi-disciplinary, it does not disambiguate researchers, making it hard to link advisor and advisee and construct lineages. Also, it does not provide publication information. More importantly, ProQuest is not publicly available, and its access is rate-limited. Apart from genealogy and thesis data, other researchers have proposed to use paper co-authorships as indirect signals of mentorship 16 . However, mentorship can start much earlier than publishing works, and it does not necessarily lead to publications 17 . To summarize, datasets about mentorship in science are in general fragmented.
Here, we start from the Academic Family Tree (AFT) website 18 and extend it to create a large-scale dataset of mentorship relationships in science. The AFT is an online portal for mentorship in science. We match each AFT profile to the Microsoft Academic Graph (MAG), a leading bibliographic database 19 . Moreover, we apply natural language processing techniques to extract semantic representations of researchers based on deep learning content analysis of their publications. Given the recent interest to understand the role of gender and race/ethnicity in science 20 , we also provide estimations of researchers' demographics. Compared to existing databases, our dataset, MENTORSHIP (MENTORship with Semantic, Hierarchical, and demographIc Patterns), covers a wide range of disciplines with a richer set of features, making it ideal for studying generalizable mentorship patterns. We expect it to be the base of future studies covering various aspects of scientific mentorship, including semantic and demographic factors.

Data sources
The AFT website displays researchers' profile information, like direct academic parents and children and a limited set of publication records in the PubMed. Originally focused on neuroscience 21 , AFT has been expanding to other areas such as chemistry, engineering, and education. As a crowd-sourcing website, contents on AFT are contributed by registered users. Contributions can be diverse, from adding a new researcher to adding mentors, trainees and collaborators of an existing researcher. Visitors can also indicate whether the website has correctly matched a profile with a publication. Due to the crowd-sourcing nature, researchers on AFT may not be a representative sample of the academic population.
In AFT, the user-contributed data are stored in a database consisting of several tables that are available online 22 . These tables are the starting point for the present work. In particular, we use four tables: (1) the people table storing researchers' basic information, including person's ID, name, degree, research area, etc.; (2) the connect table detailing mentorship relationships, including its ID, mentee and mentor person IDs, mentorship type (e.g., PhD, postdoctoral advising), and when and where the mentorship occurred; (3) the authorPub table enumerating researchers and their papers as well as meta data of papers; and (4) the locations table listing institutions and their geolocations.
We use the MAG dataset to find papers of AFT researchers. MAG contains information about papers, authors, journals, conferences, affiliations, and citations. One advantage of MAG is that all entities have been disambiguated and associated with identifiers. This dataset has been used in several recent works for author-and venue-level analyses 20,23 . Here we use four tables in MAG: (1) the Affiliations table that lists institution related information; (2) the PaperAuthorAffiliations table that records the name and the affiliation of each authorship; (3) the Authors table that contains author information including names; and (4) the Papers table that consists of paper-related metadata such as digital object identifier (DOI). Fig. 1 provides an overview of how these data sources are used to assemble the dataset presented in our work.

Normalizing researcher profiles
The people table contains 778 367 researchers, uniquely identified by person IDs. We clean this table by ignoring (1) researchers without a first name or last name; (2) researchers who have the same name, institution, and major research area but different IDs as they are likely duplicates; and (3) researchers whose first, middle, or last name contain characters that are not likely to appear in a name, such as "&" and ";". These steps leave us with 774 733 (99.5%) researchers. Besides person IDs that are used internally in AFT, there are about 1 600 researchers whose Open Researcher and Contributor ID (ORCID), a persistent identifier to uniquely identify authors 24 , are available. Although this is a small fraction (0.2%), we use this information for later validation of our methods. This ORCID information needs cleaning before using it as it contains various "orcid.org" prefixes ("https://orcid.org/", "http://orcid.org/", and "orcid.org/") and wrong format, which are manually corrected.

Extracting mentor-mentee pairs
From the connect table, we filter out mentorship pairs where mentee's person ID or mentor's person ID are not present in the curated list of researchers generated in the previous section. We then drop duplicate records and ignore records where the same relationship ID corresponds to a different mentee or mentor's ID. We obtain 743 176 mentorship pairs among 738 989 researchers.

Matching institutions between AFT and MAG
To facilitate matching AFT researchers with MAG authors, we first match institutions. To do so, we generate a list of rules to normalize AFT institution names iteratively. More specifically, we perform a greedy matching where we sequentially select the unmatched AFT institution with the largest number of researchers associated with it. We then apply several rules to normalize the name so that we can find it in the MAG institution list (see Table 2 for the rules). For institutions that cannot be matched using these rules, we manually search them in the MAG if they have at least 200 researchers and discard the remaining institutions. These steps are iterated until no more matches are possible.

Linking AFT researchers to MAG authors
As described before, one unique feature of our dataset is that we provide lists of publications authored by AFT researchers. One motivation behind this is to access the entire co-authorship network of researchers and potentially understand the topics,

2/15
venues, and citation dynamics of this network. While AFT already has publication information, it is limited to PubMed only. By matching to MAG, we can access all research areas that are not limited to biomedicine.
There are two main strategies we follow to find matches. One approach is to find, for each mentor-mentee pair, the list of MAG papers where both of their names appear as co-authors. The other strategy is to match AFT researchers using their names and affiliation information. This second strategy is necessary because some mentees have not published a paper with a mentor yet.
We first elaborate on the first strategy: matching by co-authorship. This strategy involves the following three steps: 1. First, we prepare a list of mentor-mentee name pairs. To do so, for each AFT researcher, we consider her full name as presented in the AFT. If the first name has more than one character (i.e., not first initial), we also consider two possible variations: (1) first name, middle initial, last name; and (2) first name and last name. For a mentor-mentee pair, we then enumerate all possible name pairs.
2. Second, we scan the MAG to collect papers where the name pair of two co-authors appear in the list of name pairs prepared in the first step. Specifically, for a MAG paper, we collect its co-author names from the PaperAuthorAffiliations and Authors tables. Then, we use the nameparser Python library 25 to parse a full name into first, middle, and last name. (Author names in the MAG are given as single text.) Next, we consider all possible name pairs of two co-authors and check if each pair is presented in the list of AFT name pairs prepared in the first step. Note that we only consider conference papers, journal articles, and unknown when performing the matching, ignoring the other five types of documents presented in MAG: book chapter, book, dataset, patent, and repository.
3. After scanning the MAG, we obtain a list of associated papers and the MAG author IDs for the mentor and the mentee for each mentor-mentee pair. In total, 359 238 AFT researchers have MAG papers associated with them and have at least one corresponding MAG author ID. Among these researchers, 295 630 (82.3%) have only one MAG author ID. For the rest, although multiple MAG ids are associated with them, only one of the ids accounts for more than half of the published works for the vast majority of those researchers. Therefore, we assign the most common MAG author ID to an AFT researcher if there is a single majority (98% of cases). We drop the remaining 2% and result in a total of 353 377 AFT researchers linked to MAG using co-authorship-based matching.
Next, we match the remaining 421 356 unmatched researchers with MAG using their name and institution information. The procedure is similar to co-authorship-based matching. First, we collect, for an AFT researcher, all possible name-institution pairs, by considering her name variations and institutions presented in the profile and mentorship tables (Fig. 1). We then aggregate those pairs across all researchers. Note that for only 928 (0.2%) unmatched researchers, their name-institution pairs are not unique. Next, we scan the MAG to find papers where the co-authors' name-institution pairs are in the prepared list of name-institution pairs. Through this way, we additionally match 141 078 researchers, with the total matched researchers reaching to 494 455 (63.8%).

Estimating semantic representations
Our efforts so far have yielded a list of papers for each AFT researcher who we can match in MAG. Next, we use the titles and abstracts of these papers to construct vector representations of the researcher. Such models can capture semantics, allowing us to apply them in a wide range of scenarios such as comparing the content between researchers 8 , recommendation 26 , and matchmaking of scientists 27 . Here we provide two types of representations; one is based on standard term frequency-inverse document frequency (TF-IDF) vectors, and the other is based on modern deep learning embeddings.
TF-IDF representation: The subset of researchers who we can match in MAG published a total of 16 942 415 papers in MAG. We concatenate the titles and abstracts of these papers. Then using scikit-learn 28 , we preprocess the concatenated text by removing English stop words as well as words appearing only once and apply the TF-IDF transformation. This preprocessing results in a 16 942 415 × 2 275 293 sparse matrix, with each row corresponding to a paper and each column a term. The vector of a researcher is the centroid (average) of the TF-IDF vectors of her documents.
Deep learning embedding: We employ SPECTER 29 , a representation learning algorithm for scientific documents, to obtain dense vector representations of papers. We concatenate titles and abstracts and use the implementation reported in 30 . Each article is represented by a dense vector of 768 dimensions, resulting in a dense 16 942 415 × 768 matrix for all documents. The vector of a researcher, again, is the average of the vectors of her papers.

Estimating gender and race/ethnicity
Gender in science has become an important subject of study 20 . Here we provide researchers' gender information inferred from their first names. To do so, we encode the character sequence using both the full string and sub-word tokenization as created by a pre-trained BERT model 31,32 . The output of the BERT model is passed through a pooling layer which creates a vector of 3/15 768 elements. This vector is then passed through a dropout layer and softmax layer to produce the final gender predictions. We have three genders in our dataset, two legal labels (female and male) and one unknown label, which attempts to capture potentially non-binary genders. For the training data, we use a combination of datasets. One dataset provides predicted gender of author names in the Author-ity 2009 dataset using the Genni and SexMac tools 33 . We only maintain data points where Genni and SexMac agree with each other. This filtering step left us with 2 793 982 labeled data points. Another dataset for training comes from the Social Security Administration (SSA) and is about popular newborn names and their gender 34 . The SSA dataset contained 95 026 names labeled as "male" and "female". To reduce the generalization error, we sample each class from the aggregated dataset and obtain a relatively balanced dataset with 1 500 000 data points (male: 600000, female: 600000, unknown: 300000). When training, we sample each of all three labels equally. We use 80% for training and 20% for validating. The classes in both splits are also balanced.
We also provide race/ethnicity information of researchers inferred from their full name using a similar architecture. The deep learning architecture is identical to the one used in the gender prediction above: BERT → Max Pooling → Dropout → Softmax. We combine two data sources as our training set. The first one contains the predicted ethnicity of authors in the Author-ity 2009 dataset using the Ethnea tool 35 . We map the predicted categories into four groups: Asian, Hispanic, Black, and White using the mapping described in Table 3. The second dataset consists of name and ethnicity information extracted from personal profiles on Wikipedia 36 . We map the Wikipedia labels into the same four categories of ethnicity listed before. Finally, we get a dataset with 720000 data points (black: 180000, Asian: 180000, Hispanic: 180000, white: 180000). The training and validation schedule is similar to the one followed for the gender prediction.
Both models are incorporated in our Python package demographicx 37 .

Data Records
The resulting dataset 38 has 9 main tables, shared as the files described below. Fig. 2 presents the entity-relationship diagram of these tables.
1. researcher.csv is a comma-separated values (CSV) file listing 774 733 researchers and contains the following variables: person ID (PID), first name, middle name, last name, institution, institution MAG ID, research area, ORCID, and MAG author ID. We also provide an auxiliary file named first_name_gender.csv that maps first name to inferred gender and an auxiliary file called full_name_race.csv that maps full name to inferred race/ethnicity.

Validation of gender and ethnicity estimation
We report in Table 4 the performances of our gender prediction algorithm on the validation set and the SSA set. To validate the "unknown" class, we used "unknown" labels from Authori-ty for names in the SSA dataset labeled "unknown" in the Authori-ty

4/15
dataset. For both sets, our algorithm has good performances for all three categories. Applying the algorithm to our dataset, Table 5 presents the numbers of researchers by gender. Similarly, we test our race/ethnicity prediction algorithm on the validation set and the Wikipedia dataset, obtaining good performances for all four groups (Table 6). Table 7 presents the number of researchers by predicted race/ethnicity using our algorithm. Even though the model has achieved great performance, we found that African American names are underrepresented in the training data set. Since the majority of black names are from outside the U.S., the model made predictions largely based on information about African names outside of the U.S. and might suffer from poor performance when predicting African-American names. Due to the sensitive nature of names and ethnicity, it is hard to find full names of African American names. However, we retrieved 340 names from the Black In Neuro website 39 to estimate the extend of the issue. The average probability of predicting a name to be black was 19.5%, with many names being classified as white names. While names retrieved from Black In Neuro are small and might introduce selection bias, the validation suggests that the ethnicity predictions are poor for African-American names. To improve upon this performance, we created a second model that uses only the surnames reported on the U.S. Census 40 . The performance of this second model was significantly better on the Black in Neuro dataset (30%). The validation on the U.S. Census reveals that this model has worse performance that the first model above (validation data: Black F1: 0.53, Asian F1: 0.64, Hispanic F1: 0.692, White F1: 0.52). We leave it to the user to determine which of the two models better serves their analysis.

Validation of mentorship
Our dataset covers mentorship relationships in multiple disciplines. Table 8 presents the top 20 most represented areas. Neuroscience is the one with the largest number of researchers, given that AFT was originally aimed for academic genealogy in neuroscience. Social sciences fields, like education, literature, sociology, and economics, are also well represented. Table 9 gives the count of each type of mentorship. Table 8 indicates that we can match the majority of researchers in natural sciences, but for social sciences fields like education, literature, we have lower percentages of researchers matched.

Validation of linking AFT researchers with MAG authors
To validate our linking of AFT researchers to MAG authors, we take advantage of the fact that their publications are known to be genuinely authored by them for some AFT researchers. With these publications, we examine if they also appear in the publication list of the corresponding matched MAG author. Here we focus on two subsets of AFT researchers: (1) those with papers verified by AFT website users; and (2) those with ORCID available.
Let us describe the first subset. In our previous works 8, 21 , we have automatically linked AFT researchers to publications indexed in PubMed. Those matched papers are then displayed on researchers' profile pages. AFT website users who have signed into the website can label whether the authorship is correct. We consider these labeled papers as a validation set to test the performance of our AFT-to-MAG matching of authors. To match these papers to MAG, we rely on their DOIs. For papers without DOI but with PMID, we query PubMed to get their DOI 41 .
We can now introduce the measure used to quantify the performance of our matching. Let a be an AFT researcher who has at least one verified and P a the list of her verified papers. Let also a be the corresponding matched MAG author and P a the list of papers found on MAG. We calculate the fraction of P a that appear in P a , formally: Fig. 4A, which plots the histogram of O a for the first subset of researchers, indicates the validity of our matching process; for the vast majority of researchers, we can find most of their verified papers in the publication lists of their matched MAG authors. Let us describe the second subset: papers listed on the ORCID website (P a ). To get these papers, we download the 2019 ORCID Public Data File (the most recent one) 42 , extract documents authored by researchers, and match extracted papers to MAG using their DOI. Fig. 4B shows the histogram of O a for the second subset of researchers, indicating most of their papers also appear in publication lists of corresponding matched MAG authors.

Validation of author vector
We validate researchers' vectors by comparing distances between researchers who belong to different groups. Specifically, in Fig. 5A, we show that the cosine distance of the TF-IDF vectors of a particular Ph.D. mentee, a, and her mentor, b, is much smaller than the distances between a and randomly selected researchers. Generalizing this systematically, for each Ph.D. mentee, we obtain a triplet (a, b, c) where c is a randomly chosen researcher. We then calculate the difference of the distance between a and c, d(a, c), and the distance between a and b, d(a, b). As we expect, the semantics of a mentee is more similar to her Ph.D. mentor than to a random researcher, and the distance difference is expected to be larger than 0. This pattern is indeed the case for the vast majority (97.4%) of Ph.D. mentees (Fig. 5B). We also replicate these analyses using SPECTER vectors, and the results remain similar (Figs. 5C-D): For 98.4% of Ph.D. mentees, they are semantically closer to their Ph.D. mentors than randomly selected researchers (Fig. 5D). The threshold 0 is located at 1.66 and 2.39 standard deviations away from the mean for the TF-IDF case and SPECTER case, respectively, suggesting that SPECTER is a better representation method.
To further show the structure of researchers' SPECTER vectors, we run the UMAP 43 dimension reduction technique to obtain 2-dimensional vectors and display them as a scatter plot for a 20% random sample of researchers in Fig. 6. As expected, researchers in the same research area are clustered, meaning that they are semantically closer to each other than researchers from other areas.

Usage Notes
Users can integrate our data set with MAG to study the role of mentor in mentee's academic career. MAG provides detailed information about papers and citations, from which users can derive various indicators commonly used in the science of science. We can access MAG data by following the steps outlined on its website 44 . In addition to MAG, other identifiers of publications we provide also facilitate integration with other scholarly databases. In particular, users can use CrossRef API to retrieve metadata of papers using DOI 45 . Also, we can use the E-utilities API provided by the National Library of Medicine to obtain metadata of PubMed articles using PMID 41 .
Users who want to use our released researcher vectors to perform semantic analysis can load the TF-IDF vector file using the SciPy library's scipy.sparse.load_npz function.

13/15
Gender # researchers male 374199 female 264263 unk 135732 Table 5. Number of researchers by gender. Here, the gender of the researcher is estimated by an algorithm using their first name. We acknowledge that there could be a great deal of noise and bias in this estimation. However, we believe it is better to open our algorithm to the community instead of analyzing proprietary software that does not publicize data used and performance metrics. Research scientist 7402 4 Collaborator 17833 Table 9. Mentorship type definition and statistics.